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CLAIMS 



[Claim(s)] 

[Claim 1] The speech recognition approach which creates the model of an unspecified speaker 
and is characterized by recognizing using this in the speech recognition approach using a 
continuous-distribution mold hidden Markov model by creating a model for every speaker, mixing 
output distribution of all speakers for every condition of the same voice, and considering as 
mixed distribution in case a model is created with two or more speakers' voice. 
[Claim 2] The speech recognition approach of claim 1 characterized by replacing with said 
continuous-distribution mold hidden Markov model, and two or more phonemic models sharing a 
condition in a context dependence phonemic model. 

[Claim 3] The speech recognition approach of claims 1 or 2 characterized by using a movement 
vector place smoothing method and creating a model in case the model for every speaker is 
created with two or more of said speakers' voice. 

[Claim 4] The speech recognition approach of either [ which carries out the relearning of the 
mixed multiplier using input voice, and is characterized by using the model for recognition under 
constraint of making the mixed multiplier applied to the output distribution acquired from the 
same speaker in said obtained unspecified speaker phonemic model into the same value ] claim 1 
thru/or the either of 3. 

[Claim 5] The speech recognition approach of claim 4 which reduces the number of mixing and is 
characterized by to use for recognition the model obtained by carrying out reallocation of the 
weight so that the sum of the weight of mixed output distribution may be set to 1 after that by 
removing the mixed element with which the mixed multiplier of a model became below a 
threshold from a model as a result of carrying out the relearning, when determining the threshold 
of said mixed multiplier. 
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DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application] Especially this invention relates to the speech recognition approach like 
the speaker adaptation speech recognition which made the early model the speech recognition 
and the unspecified speaker phonemic model for an unspecified speaker about the speech 
recognition approach. 
[0002] 

[Description of the Prior Art] At the former, in order to recognize the voice of an unspecified 
speaker, the phonemic model of an unspecified speaker was created by learning a phonemic 
model, without distinguishing a speaker. This approach was what does not use constraint that 
one utterance is a thing from the same speaker. 

[0003] Thus, by the approach of using for phoneme study, without distinguishing two or more 
speakers voice, distribution of each phoneme spreads, and it laps with other phonemes greatly, 
and is connected with incorrect recognition. For example, Speaker s A /a/may be the same as 
Speaker's B /o/, or distribution may lap greatly. Moreover, in order to improve the recognition 
engine performance, when the number of mixing of a phonemic model was made to increase, 
there was a fault that computational complexity increased. 

[0004] So, the main purpose of this invention is offering the speech recognition approach which 
can acquire a high recognition rate by little data, and can reduce computational complexity. 
[0005] 

[Means for Solving the Problem] This invention is the speech recognition approach which used 
the continuous-distribution mold hidden Markov model, in case it creates a model with two or 
more speakers voice, by creating a model for every speaker, mixing output distribution of all 
speakers for every condition of the same voice, and considering as mixed distribution, creates 
the model of an unspecified speaker and recognizes using this. 

[0006] In a context dependence phonemic model, two or more phonemic models share a 
condition instead of a continuous-distribution mold hidden Markov model more preferably. 
[0007] Furthermore, in case the model for every speaker is created with two or more more 
desirable speakers* voice, a movement vector place smoothing method is used and a model is 
created. 

[0008] Furthermore, in the unspecified speaker phonemic model obtained more preferably, under 
constraint of making into the same value the mixed multiplier concerning the output distribution 
acquired from the same speaker, the relearning of the mixed multiplier is carried out using input 
voice, and the model is used for recognition. 

[0009] When determining the threshold of a mixed multiplier, as a result of carrying out the 
relearning further more preferably, by removing the mixed element with which the mixed 
multiplier of a model became below a threshold from a model, the number of mixing is reduced 
and the model obtained by carrying out reallocation of the weight so that the sum of the weight 
of mixed output distribution may be set to 1 after that is used for recognition. 
[0010] 

[Function] By the speech recognition approach concerning this invention creating a model for 
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every speaker, mixing outjSIFaistribution of all speakers for every cWWition of the same 
phoneme, and considering as mixed distribution By being able to prevent confusion of the 
phoneme between speakers to voice model creation time using constraint of speaker 
coordination, and controlling only probability weight, speaker adaptation can be performed with 
very little input voice, and computational complexity at the time of recognition can be reduced. 
[0011] 

[Example] Drawing 1 is the outline block diagram of one example of this invention. The speech 
recognition method of this invention is used for example, for an automatic translation telephone, 
and as shown in drawing 1 , it consists of amplifier 1, a low pass filter 2, A/D converter 3, and a 
processor 4. Amplifier 1 amplifies the inputted sound signal and a low pass filter 2 removes a 
noise from the amplified sound signal repeatedly. A/D converter 3 changes a sound signal into a 
16-bit digital signal with a 12kHz sampling signal. A processor 4 contains a computer 5, a 
magnetic disk 6, terminals 7, and a printer 8. A computer 5 performs speech recognition based 
on the process memorized by the magnetic disk 6 based on the audio digital signal inputted from 
A/D converter 3. 

[0012] Drawing 2 is a flow chart for explaining actuation of one example of this invention. Next, 
actuation of one example of this invention is explained with reference to drawing 1 and drawing 
2 . By the approach by this invention, after creating the unspecified speaker phonemic model by 
speaker mixing, speaker adaptation by speaker weight study is performed, and speaker pruning is 
performed after that. 

[0013] First, in creation of the unspecified speaker phonemic model by speaker mixing, an 
unspecified speaker phonemic model is created by using the phonemic model created for every 
speaker as a mixed component of an unspecified speaker phonemic model. The first stage [ as / 
a lot of data of one speaker memorized by the magnetic disk 6 in the step (it is called SP for 
short in illustration) SP 1 shown in drawing 2 to whose output distribution is single Gaussian 
distribution ] HMnet Successive StateSplitting It generates using an algorithm (SSS). HMnet and 
an SSS algorithm — Takami and Sagayama — " — it is related with a phoneme context and time 
amount — serially — a part for a condition — it depends comparatively — it can hide and 
automatic generation" of the Markov network, the Institute of Electronics, Information and 
Communication Engineers voice study group data, and SPs 91-88 (December, 1991) can be used. 
This model can be used as speech recognition of a specified speaker. 

[0014] Next, parameter study is performed. In the above-mentioned step SP 1, after an SSS 
algorithni determines the structure of HMnet, as shown in a step SP 2, it asks for the parameter 
of HMnet for every speaker from comparatively little two or more speakers* voice data 
memorized by the magnetic disk 6. A movement vector place smoothing method (Vector Field 
SmoothigrVFS) is used as an approach of a parameter. About this VFS, Okura, Sugiyama, 
Sagayama, and ''movement vector place smoothing speaker adaptation method using mixed 
continuous distribution HMM" Institute of Electronics, Information and Communication Engineers 
voice study group data, SP 92-16, and 23rd page - the 28th page (June, 1992) can be used. 
Thus, HMnet which was adapted for two or more speakers is generated by preparing two or more 
speakers' voice data, and performing study of the parameter of HMnet using the VFS method for 
every speaker. 

[0015] Next, speaker mixing-ization is performed. The mixed consecutive output distribution 
HMnet is created by expressing the condition that HMnet for two or more speakers corresponds 
to a step SP 3 so that it may be shown, as one mixed output distribution. Speaker mixing-ization 
is performed in HMnet with the same structure by summarizing to one the output distribution 
which the condition of being in the same location among structure has, and expressing as mixed 
consecutive output distribution, a branching probability — same probability — or — Baum- 
Welch The relearning only of the branching probability is carried out and it is determined by the 
algorithm. 

[0016] Drawing 3 is the conceptual diagram of speaker mixing, speaker weight study, and speaker 
pruning, and especially drawing 3 (a) shows the concept of above-mentioned speaker mixing, and 
shows the output distribution of Speakers A, B. and C in Condition i and Condition j. 
[0017] next, the speaker adaptation by speaker weight study is attached [ it is alike and ] and 
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explained. The mixed conffilRus distribution HMnet acquired by ab^^-mentioned explanation 
are used as the base, and the technique of performing speaker adaptation with a small number of 
input voice is explained. In the mixed consecutive output distribution HMnet created by the 
speaker mixing SSS. the origin whether the mixed component which constitutes each mixed 
output distribution is generated from which speaker's data is known. Therefore, the branching 
probability to each mixed component can be understood to be a weighting factor to each 
speaker. For this reason, it is possible to treat the branching probability concerning the mixed 
component originating in the same speaker, i.e., a speaker weighting factor, as an "epilogue." 
Using the property of this speaker mixing SSS. as shown in a step SP 4, speaker adaptation by 
speaker weight study is performed. First, the average, distribution, and transition probability of 
output distribution are not updated, but updates the weight to the mixed element accepted by 
"the epilogue, i.e., the same speaker," among speakers only in the weighting factor using a 
Baum-Welch algorithm under the constraint of making it the same. Connection study is used for 
this study. The concept of above-mentioned speaker weight study is shown in drawing 3 (b). 
Thus, speech recognition is performed using the model which was adapted for an input speaker s 
voice. 

[0018] Next, speaker pruning is explained. When it becomes below the probability for the 
weighting factor to have been beforehand set up by speaker weight study among mixed output 
distribution of HMnet, in a step SP 5, the weighting factor is transposed to 0. Then, reallocation 
of the weight is carried out so that the sum of the weight of mixed output distribution may be 
set to 1. The principle is shown in drawing 3 (c). As shown in drawing 3 (c), a model is simplified 
by deleting all the small mixing components of the speaker weight originating in the same 
speaker. Thus, speech recognition is performed using the phonemic model to which size was 
reduced. 

[0019] Drawing 4 is the speaker adaptation by the speaker weight study by one example of this 
invention, speaker weight study + speaker pruning, and drawing showing the result of a 
recognition experiment of three kinds of Japanese clause recognition of all weight study. When 
the result of speaker adaptation is seen, as for any speaker, it understands that improvement in 
a recognition rate is obtained with very few samples of 1 - 5 word extent By the approach of 
learning all the weight of the conventional approach independently, since there are many 
parameters for study, if there are few study words, a recognition rate will fall conversely. 
[0020] Change of the number of mixing at the time of performing speaker pruning is shown in 
Table 1. Thus, although the number of output distribution decreases to about 1 / two to 1/12 by 
each speaker, especially the decline in a recognition rate is not seen. 
[0021] 
[Table 1] 
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[0022] 

[Effect of the Invention] As mentioned above, according to this invention, in the speech 
recognition approach using a continuous-distribution mold hidden Markov model, in case a model 
is created with two or more speakers' voice, a high recognition rate can be acquired by little data 
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it • 

by creating the model of ^l^specified speaker and recognizing us^^^his by creating a model 
for every speaker, mixing output distribution of all speakers for every condition of the same 
phoneme, and considering as mixed distribution. 



[Translation done.] 



http://www4.ipdl.ncipi.go.jp/cgi-bin/tran_web_cgi^ejue 



2005/03/23 



JP.07-069711.B [DESCRIPTION OF DRAWINGS] 1/1 ^— v 



* NOTICES * 




JPO and NCI PI are not responsible for any 
damages caused by the use of this translation. 

IThis document has been translated by connputer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated 
3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] It is the outline block diagram of one example of this invention. 
[Drawing 2] It is a flow chart for explaining actuation of one example of this invention. 
[Drawing 3] It is the conceptual diagram of speaker mixing by this invention, speaker weight 
study, and speaker pruning. 

[Dr.^vying 4] It is drawing showing the clause recognition experimental result after the speaker 
adaptation by this invention. 
[Description of Notations] 

1 Amplifier 

2 Low Pass Filter 

3 A/D Converter 

4 Processor 

5 Computer 

6 Magnetic Disk 

7 Terminals 

8 Printer 
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DRAWINGS 



[Drawing 3] 




[Drawing 2] 
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